Network growth models and genetic regulatory networks 
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We study a class of growth algorithms for directed graphs that are candidate models for the evo- 
lution of genetic regulatory networks. The algorithms involve partial duplication of nodes and their 
links, together with innovation of new links, allowing for the possibility that input and output links 
from a newly created node may have different probabilities of survival. We find some counterintu- 
itive trends as parameters are varied, including the broadening of indegree distribution when the 
probability for retaining input links is decreased. We also find that both the scaling of transcription 
factors with genome size and the measured degree distributions for genes in yeast can be reproduced 
by the growth algorithm if and only if a special seed is used to initiate the process. 
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I. INTRODUCTION 

The manufacturing by cells of the proteins necessary 
for sustaining life is accomplished with the aid of molecu- 
lar machinery that translates DNA nucleotide sequences, 
or genes, into their specified proteins at appropriate 
times. The first step in making a protein is the tran- 
scription process, in which the needed piece of RNA is 
formed from the relevant gene. The rate of transcrip- 
tion of a given gene can often be enhanced or suppressed 
by the presence of proteins that bind to the DNA near 
the gene, the gene's promoter region. Once transcription 
has taken place, the rate at which a given protein is pro- 
duced may be further affected by the presence or absence 
of other proteins in the cell or other factors in the chem- 
ical environment. One way to approach the study of this 
complicated system of interacting molecules is to think of 
each gene, together with its promoter region, as an agent 
that interacts with other genes via protein-mediated in- 
teractions. 

The logical structure of systems of many agents that 
exert causal influences on each other may be represented 
as a graph in which nodes represent agents and directed 
edges indicate the presence of a causal influence of one 
agent on another. In modeling the logic of the cell, for 
example, we may think of each node as representing a 
gene and each directed link as indicating that the con- 
centration of the protein produced by one gene has some 
effect on the production rate of the other gene's protein. 
More precisely, a node represents a gene together with its 
promoter region. An incoming link indicates that under 
some circumstances a given protein can affect the tran- 
scription rate of a gene. The full set of genes (with pro- 
moter regions) and interactions forms a graph that may 
be called the genetic regulatory network. The dynam- 
ics of the network is determined by parameters (reaction 
rates) associated with the links and by a function for each 
gene that determines the rate of protein production as a 
function of the concentrations of its regulators. 

Genetic regulatory networks have, at least along some 
branches of evolution, grown over time from a relatively 
small ancestral genome to a vastly complex network of 



over 20,000 genes. We would like to know which fea- 
tures of real genetic regulatory networks are reflections 
of simple physical or mathematical laws as opposed to 
finely tuned solutions of specific problems faced during 
the evolutionary process. As a first step, we study a 
class of simple models of the growth process to see which 
network features arise purely from probabilistic effects 
when selection plays a minimal role. After discussing 
the general trends associated with the variation of cer- 
tain parameters in the model, we compare the structures 
generated by the model to biological data and highlight 
some problematic issues in the interpretation of the data. 

There is substantial evidence for the hypothesis that 
network growth occurs primarily via the duplication of 
genes and subsequent mutations of one or both members 
of the duplicate pair. Under such mutations (or im- 
perfect copying) the input and output links to a gene 
may not all be preserved as both the promoter region 
and the protein produced are altered. In addition, there 
is the possibility that a mutation in a gene will cause its 
protein to bind to a new promoter region or protein com- 
plex and thereby form a link that could not have been 
inherited during duplication alone. Similarly, a new link 
could be formed due to the mutation of a promoter re- 
gion to a configuration that now binds a new protein. 
We refer to links that are generated by mutation as "in- 
novated" links; those created via duplication are called 
"inherited." Studies of yeast and E. coli genomes suggest 
that innovation may account for as much as 50% of the 
links in the regulatory network. Q 

In modeling the growth of a genetic regulatory net- 
work one must incorporate some assumptions about the 
effect of natural selection. We make several simplifying 
assumptions concerning separations of time scales. As 
we are interested here in the evolution of the network 
structure that occurs over many generations, we ignore 
the time scale corresponding to cellular processes and 
the lifetime of individual organisms. Our evolutionary 
model assumes three additional scales. First, there is the 
typical time required for a duplicated gene to drift via 
mutation to a new stable gene. Call this the "mutation" 
time scale. Second, we assume a much longer time scale 
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required for the occurrence of duplication events. That 
is, we assume that after a duplication event whatever mu- 
tation is going to occur in the duplicated gene happens 
before any other duplication event occurs. Finally, once 
a gene has mutated a certain amount and thereby found 
a niche for itself in the cell, natural selection is assumed 
to keep it stable over time scales long compared to the 
duplication time scale. Though this is somewhat of a 
caricature of evolutionary processes at the genetic level, 
it has the virtue of conceptual clarity. We note that this 
model applies only to duplication events that ultimately 
lead to an increase in the size of the genome. It does not 
attempt to model duplications that lead to adaptive ra- 
diation and eventual selection of a single duplicated gene 
as the fittest In- 
consistent with the above assumptions, our networks 
grow via duplication/mutation events. When a gene G 
(together with its promoter region) is duplicated to create 
G' , it is assumed that G remains fixed while G' mutates 
(or that a portion of it is not copied faithfully) so that 
some of the duplicated input and output links at G' cease 
to function. The parameters in our growth model are the 
probabilities of retaining inherited links after mutation 
and the probabilities of innovating links. Note that the 
binding of a protein to a promoter region of DNA breaks 
the symmetry between input and output inheritance. In 
the output case, the issue is whether a mutation causes 
changes in a protein that significantly decrease its bind- 
ing affinity to unchanged portions of DNA (or perhaps 
to other proteins). In the input case, the issue is whether 
the mutation in the promoter region decreases the bind- 
ing affinity of a protein that has not changed. 

We consider three classes of models: (1) partial dupli- 
cation, in which only a subset of links is inherited dur- 
ing a growth event; (2) partial duplication with constant 
probability innovation, in which innovation probabilities 
are independent of the characteristics of the candidate 
nodes to be linked; and (3) partial duplication with "rich- 
gets-richer" innovation, in which nodes with more in- 
puts have higher probabilities of forming innovative input 
links. The set of models we study includes as special cases 
several models studied previously. |i H H E. M. HITf! fTT| 
In the present work we emphasize the importance of dif- 
ferent probabilities for input and output inheritance and 
focus on features of finite networks rather than scaling 
laws for arbitrarily large networks. 

For each case we study numerically the statistical fea- 
tures of networks of up to 2000 nodes grown from seeds 
with 10 nodes or, for reasons that will become clear later, 
100 nodes. In several cases we provide theoretical cal- 
culations supporting the numerical results. We find that 
certain choices of parameters result in networks that have 
input and output degree distributions similar to those re- 
ported for yeast cells and simultaneously match results 
on the scaling of the numbers of transcription factors 
with genome size. Our results suggest that a simple pro- 
cess that totally neglects any specific information about 
the biological function of individual genes can produce a 



realistic network, but only if the process starts from an 
appropriate seed. 

Before turning to the details of the growth algorithm, 
we wish to emphasize the importance of considering the 
precise meaning of a link in the network, since this has 
direct implications for the comparison to experimental 
data. A link between genes could be taken to mean that 
the protein product of a gene is a transcription factor 
that binds directly to a gene's promoter region. This in- 
terpretation allows the network to be determined with 
relatively straightforward experiments that test for the 
binding of a given protein to a given promoter region. 
(See, for example, It is also possible, however, 

for proteins that fail to bind directly to DNA to assert 
regulatory control through the formation of protein com- 
plexes at a promoter region or even through participation 
in chemical processes that occur far from the DNA. For 
purposes of simulating genome-wide transcriptional dy- 
namics, all causal relationships between the expression 
levels of two genes should be represented as links in the 
network. The difference between model parameters ger- 
mane to these two interpretations is discussed in detail 
below. 



II. GROWTH OF DIRECTED NETWORKS 
THROUGH PARTIAL DUPLICATION 

We make the following definitions: 

a family of nodes is the set of all nodes arising through 
a chain of duplications of a single ancestor in the 
seed; 

a constitutive node is a regulator that has no inputs. 

a regulator is any node that has at least one output; 

an inert node is a the result of a duplication event in 
which all inputs and outputs are deleted (a consti- 
tutive non- regulator); 

a transcription factor is a regulator that has at least 
one output that has been inherited, perhaps 
through several generations, from a seed regulator 
that has been designated a transcription factor. 

In biological terms, a constitutive node represents a gene 
whose expression level never changes or else changes only 
in response to external environmental variables. An inert 
node may represent either a nonfunctional bit of (junk) 
DNA or a gene that responds to environmental factors 
but remains completely independent of the activity of 
any other genes. A transcription factor represents a gene 
whose protein binds to DNA, and it is assumed that mu- 
tations never create new transcription factors from other 
types of regulators. 

The partial duplication model is implemented accord- 
ing to the following procedure. We define a time step to 
be the time between duplication events. At each time 
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step a gene G is chosen at random from the network and 
duplicated, forming a gene G' that has all the same input 
and output links as G. One of these identical nodes, say 
G' , is then assumed to mutate. Each input link inherited 
from G is independently tested and kept with probability 
Ci, and each output link with probability c . If G' should 
lose all of its inputs and outputs, it is considered to have 
lost all its function, and thus is removed from the net- 
work entirely. As mentioned above, there is no physical 
symmetry requiring Cj = c a . 

We can gain some intuition about the effects of the 
parameters by studying some limiting cases that permit 
analytical solutions. The simplest of these is Cj = c Q = 1, 
the case in which all links are kept after every duplica- 



Ni(t + l,k) 
N {t + l,k) 



The sum on the right hand side of the first equation rep- 
resents probability of the addition of a node with k in- 
puts due to the duplication and subsequent mutation of 
a node with k' inputs. The c (k — l)A^(i, fc — l)/t term 
is the probability that the duplicated node is one of the 
(k — 1) inputs to a node K and the inherited output is 
kept, so that the number of inputs to K is incremented 
to k. The last term comes from the possibility of adding 
a new input to a node that already has k inputs. 

In the limit of large t, the sums in can be approxi- 
mated by 

J2 ( k ')c k (l-cfN(t,k')^-N(t,k/c) 7 (3) 

k'=k ^ ' C 

where we have assumed that N(t,k') does not vary signif- 
icantly over the range of k' values that have an apprecia- 
ble probability to yield k inputs after mutation [j]. The 
factor of 1/c comes from the fact that there are 1 /c values 
of k' for which \ck' \ = k. This approximation allows for 
rapid iteration of the master equation, enabling numer- 
ical studies of distributions for large t. (For analytical 
results on master equations of this type, see @.) 

For Ci = c = c the two master equations become 
identical and the asymptotic forms of the input and out- 
put distributions are the same. For 0.2 < c < 0.7, the 
approximate master equation gives an accurate estimate 
of N(t, k) for network sizes t > 100, as shown in Fig- 
ure n F° r c outside this range, however, the degree dis- 
tribution predicted by iterating the approximate equa- 
tion only matches simulation data for large values of k, 
as expected. 



tion event. In this case all nodes in the same family have 
exactly the same set of inputs and outputs. The degree 
distributions will consist of delta functions whose posi- 
tions depend on the seed network chosen and the num- 
ber of duplication events that have occurred within each 
family. 

For Ci and c less than unity, a master equation de- 
scribes the evolution of the degree distributions during 
growth. Q Let t represent the total number of nodes in 
the network. Advancing t by one corresponds to a single 
duplication event. Let Ni(t, k) and N a (t, k) represent the 
number of nodes at "time" t having k inputs and outputs, 
respectively. On average, we have 
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(2) 



As c approaches zero, the duplication/mutation pro- 
cess becomes essentially equivalent to the preferential at- 
tachment growth algorithm 6] that is known to produce 
networks with scale free degree distributions. The only 
difference is that the duplication process produces a large 
number of inert nodes, whereas preferential attachment 
counts only those nodes that do get linked to the existing 
network. For sufficiently small c, the probability of pro- 
ducing a node with more than one input or output via 
partial duplication is negligible (on the order of c 2 ), so 
each new node (that is not inert) is added to the network 
with a single input or output. The probability of forming 
a new input to node G is proportional to the probability 
of selecting an input node to G for duplication, which in 
turn is proportional to the number of inputs that G al- 
ready has. For c — > 0, ignoring the production of linkless 
(inert)nodes leads to: 

i\T(f+l, k) = N(t, k)+S(k-l)+-[(k-l)N(t, k-l)-kN(t, k)}. 

The total number of links in the connected part of the 
network approaches t as c gets small, so this master equa- 
tion is equivalent to the one given for preferential at- 
tachment by Barabasi et al., which is known to yield a 
scale-free distribution N(t, k) — fc~ 7 with 7 = 3. 0,13 

For c = 0.5, our model is equivalent to one proposed 
by Dorogovtsev et al, who predicted scale free behavior 
in the degree distribution function, with the frequency of 
occurrence of indegree (or outdegree) k decaying like k^. 
|4| This scale free behavior should occur, however, only 
over the domain of very large t and k. The systems we 



Ni(t, k) + - J2 (YW " C ^' N ^ k ') + ji( k ~ i)^- * - 1) - kNi(t, k)}. 

k' = k ^ ' 

N (t, k) + \j2 (fc) C ° (1 - c of N o(t, k') + j[(k- l)N (t, k - 1) - kN (t, k)}. 
k'=k ^ ' 
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FIG. 1: Simulated (circles) and predicted (squares) degree 
distribution functions for different values of c. (a) c = 0.1; 
(b) c = 0.3; (c) c = 0.5; and (d) c = 0.7. For models with no 
innovation and a = c the indegree and outdegree distribu- 
tions are identical. 



FIG. 2: Indegree (circles) and outdegree (squares) distribu- 
tion functions for c = 0.5 and different values of c^: (a) 
d = 0.1; (b) a = 0.3; (c) a = 0.5; and (d) a = 0.7. Each 
data point represents the average number of nodes with a 
given number of inputs in an ensemble of 100 networks. 



have studied are not large enough to show the predicted 
scaling, as we can see from the (c) panel in Figure Q 
References [jj 0, fj| also provide analytical results on 
the asymptotic scaling properties for small c and failure 
of self-averaging for large c in similar models. 



A. Effects of different values for a and c 

Holding c Q constant at 0.5, it is interesting to see what 
happens as Cj is varied. To compare the distributions for 
different parameter values we choose to keep the total 
number of (non-inert) nodes fixed at 1000. That is, we 
simulate the network growth (not the approximate mas- 
ter equations) from a seed of 10 nodes with 10 randomly 
assigned connections, discarding inert nodes and stop- 
ping when the network contains 1000 nodes. The data 
shown in Fig. [2] are averages taken over 100 networks. 

Over the range of sizes probed, the indegree distribu- 
tions are roughly scale-free for smaller values of Ci as seen 



in Fig.|2Ja). We note that the scaling exponent suggested 
by these data is approximately 7=1, but the systems 
are not large enough to be described by the asymptotic 
scaling regime of the relevant master equation, so we do 
not expect this scaling to persist to substantially larger 
system sizes. A surprising result is that decreasing a 
actually tends to broaden the tail of the indegree distri- 
bution. One may have expected a shift of the distribu- 
tion favoring smaller indegrees because the probability of 
keeping a large number of inputs in a given duplication 
event decreases with decreasing c^. There is a competing 
effect, however, associated with the increased production 
of inert nodes, which are not counted as part of the 1000 
nodes that constitute the final network. Because nodes 
with small indegrees tend to produce inert nodes, the 
probability of a node being duplicated is skewed toward 
nodes with high indegrees. In the limit of very small 
the situation is similar to the preferential attachment 
limit discussed above, where nodes with more inputs are 
more likely to receive new inputs when another node is 
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duplicated. The difference here is that the newly cre- 
ated node may have many outputs rather than the fixed 
number stipulated in standard preferential attachment 
models. 

Note that Fig. |2c) is not identical to Fig. ^c), even 
though they correspond to the same parameter values. 
This is because of the difference in the criteria used for de- 
termining the sizes of the networks in the ensemble. For 
purposes of comparing distributions to those predicted by 
the master equation, we must include inert nodes in the 
simulation. That is, inert nodes are counted as contribut- 
ing to the system size. For the purpose of examining an 
ensemble of networks with a given number of genes, how- 
ever, we do not count the inert nodes. Interpreted in this 
context, the data shown in Fig.^c) correspond to an en- 
semble with a distribution of networks sizes, all smaller 
than 1000. 



III. EFFECTS OF INNOVATION 

The pure partial duplication model does not contain a 
mechanism for innovation of new links. We now consider 
two models for representing the general effects of innova- 
tion. The first model is similar to that one introduced by 
Sole 0): each time a new node is created, every possible 
input and output link it might form is tested and kept 
with independent probability p. In the second model, the 
probability of keeping a tested link is assumed to depend 
on the indegree of the node at the input side of the link; 
nodes with more inputs are assumed to have a higher 
probability of accepting new inputs. The motivation for 
the latter comes from the ability of proteins to form com- 
plexes or interact in ways that affect transcription, which 
suggests that if more proteins participate in the regula- 
tion of a given gene, there are more opportunities for a 
new protein to exert a regulatory influence. Note that 
this is not true on the output side; the probability that 
a given protein will participate in a particular regulatory 
link should not change just because that protein begins 
to participate in additional links. 



A. Constant probability innovation 

Innovation is added to the partial duplication model 
as follows. At each time step a gene G' is produced by 
partial duplication. G' is then given a chance to develop 
an innovative input from all of the transcription factors 
in the network. Note that in this model all regulators are 
transcription factors. It is impossible for a node with no 
outputs to become a regulator. If gene G' has inherited 
outputs (and thus is a transcription factor), then it is 
given a chance p to bind to every node in the network. If 
G' has no inputs or outputs after all inherited and inno- 
vated links are tested, it becomes inert and is assumed to 
remain inert forever; it can no longer receive innovated 
inputs. 
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FIG. 3: Indegree (circles) and outdegree (squares) distribu- 
tion functions for c = 0.5 and a = 0.1, with various values 
of p: (a) p = .001; (b) p = .005; (c) p = .01. Each different 
value of p results in a different percentage of innovated links: 
(a)39.1% (b) 59.8% (c)64.6%. 



For the present study, networks were grown until the 
number of non-inert nodes reached N = 1000. Figs. |3| 
and 01 show the indegree and outdegree distributions for 
c G = 0.5 and c; = 0.1 and 0.5, respectively, for several 
values of p. 

Comparing Figs. and 0] to Fig. [3 we see that in- 
novation serves to shift the weight in the outdegree dis- 
tribution to larger values of k and produces a peak at 
nonzero k. This qualitatively matches the yeast data of 
Lee et al., who find that the output distribution has a 
peak at a value near 3Q.|l2fl However, innovation also re- 
duces the weight in the tail of the indegree distribution. 
This somewhat surprising result is due to the fact that 
innovation decreases the number of nodes with only a 
single input, thereby decreasing the rate at which du- 
plications yield inert nodes, which in turn decreases the 
number of duplications of high indegree nodes. 



B. Rich-gets-richer innovation 

As mentioned above, links corresponding to the direct 
bindings of transcription factors to DNA do not exhaust 
the possible sources of regulatory control. To model the 
networks observed in experiments that detect the influ- 
ence of protein interactions and other regulatory effects, 
one must consider models in which non-regulatory nodes 
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FIG. 4: Indegree (circles) and outdegree (squares) distribu- 
tion functions for c = 0.5 and d = 0.5, with various values 
of p: (a) p = .001; (b) p = .005; (c) p = .01. Each different 
value of p results in a different percentage of innovated links: 
(a) 16.2% (b) 36.5% (c)43.2%. 



can develop outputs via innovation. If protein interac- 
tions are an important source of regulatory control, we 
might expect that any gene could possibly innovate an 
output to any other gene, and that genes with more in- 
puts would be more likely to gain further inputs. This 
"rich-gets-richer" effect may produce scale free indegree 
distributions, as in simpler preferential attachment mod- 
els. 

One might imagine many different ways in which the 
probability of receiving an innovated input could depend 
on the indegree of a node. The situation of interest, 
though, is one in which the interactions between nodes 
are determined by physical properties of molecules and 
therefore should not depend on the order in which nodes 
were added to the system. That is, the probability that a 
link from G to H is innovated may depend on the number 
of inputs to H, but should not depend on whether those 
inputs were generated before or after G was created. This 
constraint restricts the class of models considerably, as 
described below. 

The model is implemented according to the following 
procedure. At each time step, first a gene G is chosen at 
random from the network and partial duplication occurs 
with probabilities Ci and c Q as above. Each gene in the 
network is then tested to see if it innovates an input to 
G' and/or receives an innovative input from G'. When- 
ever a new gene G' is created, it is given a probability 



p(n) of having an output to each gene H, where n is the 
number of inputs H already has. (Note that p(0) does 
not vanish. There is some probability that the new gene 
will regulate a gene that previously had no inputs.) In 
addition, any time that any gene J acquires one or more 
new inputs, whether via duplication or innovation, all 
other genes K in the network are given probability q(A) 
of forming a new input link to J, where A is the number 
of inputs that J has gained since the last time the forma- 
tion of an innovative link from K to J was tested. This 
latter process is iterated until a complete pass through 
the network generates no new links. Multiple time steps 
are performed until the network contains N genes with 
at least one input or output link. 

To ensure that the probabilities of links being present 
are independent of the order in which new innovations 
are attempted, we must choose 



p{n) 
9(A) 



I _ g-(l+n)/«o. 
1 _ p-A/no 



(5) 
(6) 



where 1 /no gives the probability that a gene with no in- 
puts will obtain an input from any particular gene. The 
exponential form of q(A), means that [1 — q(Ai)][l — 
g(A 2 )] = [1 - g(Ai + A 2 )] for any A x and A 2 . This en- 
sures that the probability of there being no link between 
two genes is independent of the number of times the po- 
tential link was tested during the growth process. Thus 
our model is a self-consistent growth algorithm in which 
the probability of adding a new link can always be deter- 
mined without worrying about the order in which links 
are tested and added to the system. 

The amount of innovation and the values of Ci and 
c are roughly similar between Fig. 0(a) and Fig. 0(a), 
Fig. El (b) and Fig. 0(c). These pairs of figures show that 
rich-gets-richer innovation broadens the indegree distri- 
bution compared to constant probability innovation, as 
expected when high indegree nodes are favored for inno- 
vated inputs. We also note the general trend that rich- 
gets-richer innovation gives rise to a less pronounced peak 
in the outdegree distribution and to clearer power law 
tails. 

The differences between constant probability innova- 
tion and rich-gets-richer innovation may be relevant for 
the modeling of biological data. At the very least, these 
differences highlight the importance of obtaining a clear 
understanding of the types of regulatory interactions that 
are included in experimental reports on the structure of 
genetic regulatory networks. 



C. Selecting a starting seed 

The discussion and results above focused on generic 
behaviors expected when a network has grown to many 
times the size of the initial seed. Unfortunately, it ap- 
pears that no choices of the parameters c,, c Q , and p, 
can reproduce degree distributions qualitatively similar 
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FIG. 5: Indegree (circles) and outdegree (squares) distribu- 
tion functions for c = 0.5 and different values of cc (a) 
d = 0.1; (b) a = 0.3; (c) a = 0.5; and (d) c, = 0.7, with 
no = 3000 in the rich-gets-richer innovation model. The per- 
centage of innovated links is (a) 48.3%; (b) 41.4%; (c) 37.5%; 
(d) 35.3%. 



to those reported for yeast in experiments measuring 
transcription factor binding. In those experiments, it is 
found that the output distribution is extremely broad, in- 
cluding an appreciable number of nodes with more than 
130 outputs, while the input distribution remains quite 
narrow, contai ning very few nodes with more than about 
10 inputs. [H[ll 

There is, however, another piece of biological evidence 
that suggests that real genetic regulatory networks can- 
not be in the asymptotic large TV regime. Nimwegen has 
observed that the number of transcription factors in an 
organism scales like N a , with alpha w 1.26 for eukary- 
otes and a « 1.87 for bacteria, where N is the size of the 
genome. ^EJ A scaling law of this type with a greater 
than unity is impossible for arbitrarily large genome sizes 
since the number of transcription factors cannot exceed 
the total number of genes. We are thus led to consider 
models in which the number of transcription factors is ini- 
tially quite small compared to the total number of nodes 
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FIG. 6: Growth rate of the number of transcription factors 
for the parameters d — 0.2 , c = 0.5, with (a) p = and 
(b) p — 0.005. The networks are grown from a seed of one 
regulator linked to 99 other nodes. Solid lines show the scaling 
laws consistent with Nimwegen's analysis of (a) bacteria and 
(b) eukaryotes. 



in the network. 

We study networks of 2000 nodes grown from a seed 
consisting of a single node that regulates itself and 99 
other non-regulating nodes. This is roughly consistent 
with an extrapolation from Nimwegen's observations to 
genomes with only 100 genes. 

Since we are now considering data that identify tran- 
scription factors rather than all regulatory interactions, 
we work with the constant probability innovation model. 
As seen in Fig.El the number of transcription factors does 
indeed grow roughly as a power law, with an exponent 
that depends on the value of c and p. The straight lines 
shown on the plots indicate scaling exponents roughly 
consistent with Nimwegen's reported values. (No at- 
tempt was made to search parameter space for optimal 
fits to Nimwegen's exponent for bacteria.) 

In Fig. EI one can see evidence for a region of quadratic 
scaling for very small system sizes, implying that when 
a gene is selected for duplication, the result is twice as 
likely to be kept in t he g enome if it is a transcription fac- 
tor than if it is not. We can understand this effect in 
the context of our model as follows. If c is sufficiently 
large and transcription factors have many outputs, the 
chance that partial duplication of a transcription factor 
will result in an inert node is negligible. On the other 
hand, most genes have rather few inputs, so the proba- 
bility of duplication of a non-transcription factor leading 
to an inert node is appreciable and is also sensitive to 
Cj. Since duplications of transcription factors increase 
the number of inputs to many of the nodes, exact calcu- 
lation of the rate at which inert nodes are generated is 
difficult. We find numerically that a ~ 0.2 leads to an 
initial growth phase with roughly quadratic scaling. 

It is interesting to compare the degree distributions 
obtained from the model with parameters that yield a 
scaling exponent consistent with Nimwegen's analysis of 
eukaryotes to the distributions reported for yeast. 01 
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FIG. 7: (a) Indegree (circles) and outdegree (squares) distri- 
bution functions for the model parameters equal to those of 
Fig.02»: c = 0.5, a = 0.2 and p = 0.005. (b) Same as (a) 
except a = 0.3. The networks are grown from a seed of one 
regulator linked to itself and 99 other nodes. 

Fig.0sh.ows that we obtain an indegree distribution with 
no extended power-law tail together with an outdegree 
distribution with a broad power-law tail. The widths 
and shapes of the distributions are roughly consistent 
with those observed for yeast, including such features 
as the power-law outdegree tail with exponent near 2 
and a slower-than-exponential decay at small indegree 
that is rapidly cut off above a maximum near 30 in the 
outdegree distribution. For the model parameters used 
here, approximately 28% of the links in the system are 
formed via innovation. 

We note that Harbison et al. [Tij . using a different 
approach from Lee et al. 0], report data on outdegrees 
for 203 transcription factors that show a distribution ex- 
tending to approximately 300, but appearing to have an 
exponential form and no peak. In the Harbison data, 
however, outdegrees are determined as the union of re- 
sults of different preparations that show wide variations 
in the outdegrees of individual genes. A third source of 
data on the yeast network is available from Milo et al. 
[l3| Careful comparative analysis of all of the available 
data is beyond the scope of the present work. 

IV. CONCLUSIONS 

We have introduced and studied two classes of growth 
algorithms for networks with directed links. Both involve 
partial duplication of nodes as the sole growth mechanism 
and separate parameters determining the probabilities of 
keeping inputs and outputs of the duplicated node. In 
one case, the opportunities for the innovation of new 
links are given a constant probability and every possi- 
ble output link is tested exactly once during the growth. 
In the other, nodes with more input links are given a 
higher probability of receiving new inputs, and poten- 
tial input links are re-tested every time a node receives a 
new input. In the context of our simplified description of 



genome growth, the first model is appropriate for study- 
ing transcription factor binding networks, where output 
links correspond only to direct binding of a protein to 
DNA. The second model corresponds to the full network 
of regulatory interactions, in which some proteins that 
do not bind to DNA can still exert regulatory control. 

We have studied indegree and outdegree distributions 
and the relation between the number of transcription fac- 
tors and system size in networks of up to 2000 nodes 
grown from random seeds with 10 nodes or special seeds 
with 100 nodes. Some counterintuitive results concern- 
ing the effects of various parameters on the degree distri- 
butions were observed and explained. Parameters were 
found that produce networks with degree distributions 
similar to those reported for yeast cells and plausibly re- 
alistic scaling laws for the number of transcription factors 
as a function of genome size in eukaryotes. 

For our growth model, the production of realistic net- 
works requires a seed in which many nodes are regulated 
by a single transcription factor with a self-input. While 
we do not know the detailed regulatory network struc- 
ture of any ancestral organism with only 100 genes, it 
is plausible to suggest that early simple organisms re- 
quired relatively few regulatory genes that controlled a 
large number of structural genes. 

Several features of transcriptional network architecture 
would be natural candidates for further study. We have 
not yet analyzed the frequency of occurrence of small 
network motifs or clustering statistics. It is known that 
partial duplication induces strong local correlations in 
a model with identical probabilities for retaining input 
and output links. |7j We conjecture that minor modifi- 
cations to our model favoring innovation between genes 
that share a neighbor could account for higher clustering 
coefficients and favor local motifs without altering the 
overall degree distributions or transcription factor scal- 
ing laws. 

We have explored the behavior of a class of network 
growth models that may be relevant for understanding 
the structure of genetic regulatory networks. Our results 
indicate that several nontrivial features of presently avail- 
able biological data can arise simply from probabilistic 
growth rules with no notion of optimization due to se- 
lection coming into play. This suggests that statistical 
features of real genetic regulatory networks may be de- 
termined by physical or biochemical parameters rather 
than careful tuning through natural selection. 

This is not to say that natural selection plays no role in 
determining which networks survive and prosper. Among 
the ensemble of networks generated by our growth mod- 
els, there are many possibilities for variation. For ex- 
ample, the causal influences indicated by links in our 
networks, and hence the dynamical properties of the net- 
work, may evolve under selection pressures via small mu- 
tations that do not affect the network architecture. Nev- 
ertheless, it is interesting to see that simple probabilistic 
growth rules can account for an ensemble of possibilities 
available to the fitness selection process. 0] 
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